Automatic information extraction from semi-structured Web pages by pattern discovery

نویسندگان

  • Chia-Hui Chang
  • Chun-Nan Hsu
  • Shao-Chen Lui
چکیده

The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources. D 2002 Elsevier Science B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Semi-structured Information Extraction Applying Automatic Pattern Discovery

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. He...

متن کامل

Efficient RSS Feed Generation from html Pages

Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents EHTML2RSS, an efficient system that translates semi-structured HTML pages to structured...

متن کامل

Automatic Discovery of Lexical Patterns using Pattern Extraction Algorithm to Identify Personal Name Aliases with Entities

The personal name aliases are extremely significant in information retrieval to retrieve complete information about a personal name from the web, as some of the web pages of the person may also be referred by his or her alias name / nick name / real name. There is a rapid growth in people searching where the personal name aliases are concerned. We proposed a pattern generator which includes aut...

متن کامل

Automatic Information Extraction for Multiple Singular Web Pages

TheWorld WideWeb is now undeniably the richest and most dense source of information, yet its structure makes it diÆcult to make use of that information in a systematic way. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. IEPAD is proposed to automate wrapper genera...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Decision Support Systems

دوره 35  شماره 

صفحات  -

تاریخ انتشار 2003